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Abstract 

This study compares various item option scoring methods v;ith 
respect to coefficient alpha and a concurrent validity coefficient • 
The scoring methods under consideration were: (l) formula scoring, 
(2) a priori scoring, (5) empirical scoring v/ith an internal criterion, 
and (h) two modifications of formula scoring. The study indicates & 
clear superiority of the empirically determined scoring system v/ith 
respect to both coefficient alpha and the concurrent validity. 



A COMPARISON OF VARIOUS IfEM OPTION V/EIGIfTING SCHEMES 

Gary Echternacht 
Educational Testing Service 

One of the essential goals in measurement systems research *s lo 
extract as much infonnation as possible from given set of items • This 
allows the test constructor to use fev7er items in a oest, v/hile retaining 
a previously set reliability standard. This, in turn, is especially 
desirable in the case v/here items are difficult and/or expensive to 
construct. The general problem of increasing the amount of information 
from an item requires examining one or all of three components: (1) ho.^ 
the examinee is to respond to the item, (2) hov; an item is scored, and 
(5) how the items are put together to form a total score. 

If one assumes that the multiple-choice fonnat that now exists will 
be continued in use for some time in the future and considers that past 
research with weighting items differentially has pi en unfruitful 
(Stalnaker, 1958; Wilks, 1958), one concludes that the. most productive 
area of research lies with investigating various scoring methods or, in 
other words, differential weighting of options of an item. There are two 
different general methods of weighting item options most oi^ten accepted. 
One involves empirically weighting options using some internal or external 
criterion, the other an a priori weighting of the options. 

Weighting by using some internal ox external criterion dates back 
to the 1920 *s when Strong began work on his interest inventory (Strong, 
19U5). Tnis type of criterion V ying, usually with an external criterion, 



This study was sponsored by the Graduate Record Examinations Board. 



v;as used mostly with self-report types of items* Strangely, the question 
uf di ff jrentlally v/eighting the options of achievement and ability items 
iG improve reliability has received little attention. This has been true 
oince: (l) the use of an external criterion v;ith achievement and ability 
Is time consuming, expensive, and somev;hat prone to error in the criterion 
measure; and (2) obtaining weights with minimal sampling variance requires 
ri Invrp Qxnovtxit of data and much computation. 

As mirht bo expected, a priori weighting of test items (with differ- 
hir v.'^i(5his for distractors) has not been widely practiced. Gage (195?) 
hjvi Yee and Kriewall (l9^9) have used a priori scores on the Minnesota 
L^'dchoi* At':itndf^ Inventory with an effectiveness equal to that when the 
n.of' ^jlaborate criterion keying was used. Davis and Fifer (1959) used both 
ferns of option weighting in raising the cross -validated comparable -forms 
roliabiJity of a specially prepared arithmetic-reasoning test. Other thaJi 
^>ha^, there spp-7ar to be few noteworthy attempts to use a prio ri option 
//ei;-htjnr. Although not /generally thought of as being a priori , because 
equal v/< irhts nrr- riv^ n ho all distractors, the usual formiaa score, along 
v/ith ib3 ir:odii ic?^- ions, is an a priori system. 

In ^-ontraGtlrir ^.'irpirl'^al .voighting of options and a priori weighting 
of options, //eLf^htinr ^^np i rally seems r.o suffer from one major diffi- 
culty, Th^j ^-x^)inino^ <]o-'*s not know the consequence of his action when 
f'spondinr. lo an;/ giv^n item or, in other words, he does not understand 
th'^ scoring system being usod^ a defect that appears to this writer to be 
sornev;hat unethical. Onr would be inevitably asked the question why person 
A ror-eivod a scor^ of X on given item and person B received a score 
of Y on the sarn'-* qu^jstion even though both answered incori-octly. One 



v/ould be hard pressed to ansv/er with anything satisfactory to the 
examinee. 

It seems to this v^riter that the most fruitfiil search for a sample, 
easily understood, ethical system of differential v/ei(5hling lies v/ith 
a priori weighting. This process has some problems of its ovm thou/jh. 
For example, most a priori option weighting studies have utilized a pa/^el 
of judges for supplying the weights, which introduces a further source of 
error into the v/eighting system and can be somewhat expensive* It would 
seem more desirable if the test item writer could specify the weights in 
some predetermined manner as he developed the various distractors. The 
work of Elizur (1970 ) and Guttman (1965) with facet design has provided 
an indication that this might prove to be a fruitful method for construct- 
ing a priori weights* Also, there is some question as to whether item 
writers can construct items losing facet design though that question was 
not investigated in this study. 

Purpose and Procedure 

The purpose of this study is to determine the effectiveness of various 
item option scoring schemes, especially empirical, and a priori schemes, in 
relation to formula scoring and some of its modifications. Since tne ulti- 
mate goal is to shorten the test and retain the same degree of reliabiT.ity, 
the reliability of the test oonder these variolas scoring schemes becomes 
the prime measure of effectiveness. Thus, the reliabilities obtained 
under the different scoring schemes will be of prime importance. 

Of secondary importance (only in this instance) is the question of* 
validity. In a study such as this, it was not feasible to collect any 
completely adequate criterion measure although a similar (not parallel) 



test of greater length was thought to be useful for obtaining a concurrent 
validity statistic* Such v/as possiole in cue operational structure of this 
study, and the correlation betvzeen the experimenoal test score using the 
various scoring methods and the longer test served as -a validity check. 
This study ^vas conducted through the operational framev/ork of the Graduate 
Record Examinations (gRE) program, v/ith the experimental test embedded 
V7ithin the GHE Aptitude Test, which served as one of the regular pretest 
sections. The items v/ere quantitative in nature, and the longer, similar 
■cest meritioned above consisted of the regular GRE quantitative test section 

In order to deteimne the effectiveness of the scoring system, six 
random samples v;ere drav/n from the total number of examinees taking the 
specially designer: test fom. Values of coefficient alpha were calculated 
for each scoring system on each sample. In addition, correlations betv/een 
the jaain section quantitative test score and the special test score were 
obtained for each scoring system. 

Test Construction 

A 30-itern quantitative pretest section v/as constructed especially for 
this study. This pretest section appeared in the June 1972 administration 
of the GilB Aptitude Test. The ^30 test items were written completely by the 
nciucational Testin^^, Service Test Development Division. According to sped " 
cations provided by the study director, they were instructed to construct 
itemr with one correct 9iisv;er, two distractors differing from the correct 
answer in only one aspect (one error in logic or operation) and two dis- 
tractors differing trom the correct ansv/er in more than one aspect. The 
distractors v/ith only one error v/ere termed "first order" distractors, 
v;hil^ the remaining v/ere termed "second order" distractors. T!.o item 



vrriters kept a log of the time required to v/rite and revie;*; the items, 
so the additional costs for writing such items could be computed. 

In the a priori scoring scheme, no attempt was made to diff'eronla a^e 
betvreen the two first order distractors. The sairie was true for the 
second order dis tractors. 

Scoring Systems under Consideration 

The xasual scoring system for GIIS tests is '.o give one point for a 
correct ansv/er, zero for an omit, and -l/h for an incorrect answer. Tiiur;, 
the formula scoring system becomes a baseline systeiTi for rnaMnr conir- "!::or. 
Since it is extremely easy to construct, a rights only scorinr systerr. vac 
also used. 

The a priori scoring system was developed with the following properti 
in mind: (l) the scoring system shouild use integer scores; (2) "Chc ezpoc: 
score under random guessing should be zero; and (5) the intervals bc-.v/^^en 
the scores should be equals excluding the omi"C score. Thus, a scoring 
system was used tnat gave the score of 6 to a correct cJiswer, a score of 
1 if a first order distractor were chosen, and a score of for selection 
of a second order distractor. All omits were scored as zero. 

The procediire for obtaining empirical scores for each item option, 
including omit, was to use the keying for internal consistency procedure 
found in Reilly and Jackson (3.972) which is sarjnilar to that of Hendrickson 
(1971). The computational details will not be given here, but basically 
the process consists of fi^st^cgring the test using the conventional 
scoring formula (rights « l/k wrongs); secondly, assigning the weight 
determined by the mean standard score on the remaining items for all per- 
sons choosing that option; and finaJ.ly, computing coefficient alpha. The 



procedui^e can be used iteratively until coefficient alpha appears to 
stabilize, although Reilly (personal communication) notes that such 
iterations fail to change coefficient alpha by any sizable amount, at 
least when the test is already fairly reliable to begin vrith* 

Although the expressed pvtrpose of this study was to compare a priori , 
empirical, and formiola scoring methods, it was relatively easy to add tv;o 
others that v/ere modifications of formula scoring. The motivation for 
including these scoring systems in this study was that they had recently 
appeared in the literature, and there were no empirical results where 
these systems were used. It was also very inexpensive to incorporate 
these systems into the design. The two systems were recently developed 
by Zinger (1972) and are termed Zl and Z2 . The Zl scoring system • 
gives a o o e of one to a correct answer and a score of -c to an 
incorre. i ^ns/er. The value c is determined by 

a-1 a-1 ry . . 

c ^ >: n. / ( L nj- (1) 
i-1 i-1 

v;here a indicates the number of alternatives, n^ the number of examinees 
responding to the ith distractor, and the s\amma,tions -^^-e taking over the 
distra?tors. 

The Z2 system gives a score of 1 + b to a cori- z'- answer, -c 
•if the answer is incorrect. The value b is detemi^'-a by 

a-1 _ a-1 

b z (n. - n)^- / (n^ Z n ) (2) 
•1=1 ^ ^ i=l 

where indicates the number answering correctly and 
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These Wo scoring systems are based on the concept of "ideal" items as 
presented by Weitzman (l970) and provide a correction for guessing that 
takes into account a nonunifoim distribution of wrong answers assumed by 
the formula scoring. In essence, the distractors are vreighted more 
negatively as the distribution becomes more nonuni f om . In both these 
syst^s omitted items were given a score of zero. 

For the empirical weights and the Zl and Z2 weights an initial 
saiiiple is needed for calculation of the item option weights* The v/eights 
thus obtained are then used in each subsequent scoring replication. 

S^^mpling 

As stated previously, the pretest section .'as spiralled and thus was 
taken at most test centers across the coaxntry. Since se^ren independent 
samples were req-uired, it was decided to use a two-stage process in making 
sample selections. Tne first stage consisted of selecting test centers, 
while the second stage consisted of selecting students v/ithin a test center* 
Actually, a test center represents a fairly good primary sampling unit as 
the students in these centers tend to be somewhat homogeneous with respect 
to undergraduate institution and geographic region. 

Further, it v/as decided to balance the sample with respect to ability 
level as measured by the number correct on the regular quantitative test 
section. Thus, ctitting scores were developed, using the entire sample, 
for classifying any individua.1 into the lower, middle, or upper third in 
quantitative ability as meastired by the number correct on the regular 
section. Also, it was of interest to have some samples completely female 
and others completely male in makeup. A two by three table (sex X ability 
level) was conceptualized for further selecting test centers. 
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A count of the nmber of test centers having X individuals in 
each cell of the table described above was made. The nmber of test 
centers v;as further broken down by geographic region (Censxis Bureau 
classification) for each value of X . X varied from one until it vras 
so large that no test center had at least X in each cell. 

The first sample was designed to be. a base sample for calculating 
the various v/eights involved in the empirical scoring and the Zl and 
Z2 systems • This sample had to be at least 2,500 in nmber since it v/as 
desired to have the standard deviations of the estimated proportion of 
people responding to a particular alternative be less than •Ol, It was 
also desirable zo select as many centers as possible to make up this 
sample in order to represent as v;ide a range as possible. Therefore, all 
centers haying at least two candidates per cell were selected for the base 
sample or sample 1. There were 231 such centers. Within each center, 
candidates v;ero classifiea into the six cells and two candidates were 
selected usinr; simple random sampling. The resiilting sample size for the 
base sample was 2,772. 

Six samples of size approximately 1,000 were to be selected for com- 
putinr rrfjciency. Of these, two were to consist entirely of females, 
tv;o entirely of males, and the remaining two balanced. Sample 2 (female) 
arid oample h (male) v;ere selected by sampling 112 of the l86 test centers 
having at least three candidates per cell. This sampling of test centers 
v;as carried out using proportional al3,ocation over the four geographic 
fCf'ions. Sampling within test center was accomplished using simple random 
sajnp3ing as before. The resulting females werv. termed sample 2; the 
^"^•r.\dtjnf^ mal05>, sajnple k. 



Samples 5 (female) and 0^:)*jie) were obtained as w-ere samples 2 and 
^, only centers having at least car per cell were selected. A total cf 
85 out of the 151 possible test *-'xters v/ere so selected. 

Samples 6 and 7 vrere mixed with equal ni^mbers of males and fernal-'S 
selected from the five nud six per cell centers. A total of of 1:30 
five per cell centers were selected, v^-hile 28 of 109 six per cell ^^jenters 
v/'ere selected. 

Results 

The GRE aptitude test is a moderately speeded test (Swineford, 19^8), 
which creates a number of problems in determining empirical weights. The 
problem usually occurring is that the omit score becomes extremely lar^e 
negative and che validity of the test is reduced (see Reilly & Jackson, 
1972) even though alpha is increased. In examining a preliminary item 
analysis of the special test section, it became apparent that the special 
test v^as also speeded. In fact only about 17^ of the, examinees finished '.h< 
50 items I 

in order to reduce the effect of speed it was decided to eliminate 
some of the it^ms fVom the special test for the analyses. A response rate 
of ^^io for the entire test was felt necessary. By examining the item 
analysis, it was determined that 92^^ of the examinees finished the first 
18 items. Thus, only the first I8 items of the special cest were scored. 

The resulting values of coefficient alpha for the scoring systems 
under study on each replication tppeai^ in^Table 3.. As can be seen Iho 



Insert Table 1 about here 
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maximum coefficient alpha is obtained when empirical weights are used. 
This is not unexpected since the empirical weights tend to maxi* Ize 
coefficient alpha, l^at is important is Lhat these values are substantially- 
higher for empirical weights, equivalent to increases of 31 •6/0, 32. 5/^, 
50.8fo, 55.7/^^ 3^.9/^^ and 32.5^ in the test length when formula scoring is 
used. On the other hand, the a priori and Zl and Z2 -jystems did not 
equal the performance of formula scoring 

The correlations with the main section quantitative score closely 
resemble the results of the coefficient alpha calculations at least in 
pattern. These correlations are presented in Table 2. 

Insert Table 2 about here 

These results are in contrast to tl^ose found hy Reilly and Jackson 
(197^)^ v^here a decrease in valiaity was found. It should be pointed out 
that they used undergraduate graaes as a '^riterion measure rather than 
another test as was done in this study, so that the fjndjngs of these tv/o 
studies are not contradictory, but rather illustrate different choices of 
criteria. 

A few further points regarding the conduct of this research should 
be pointed out so that the conclusions resulting from this study can be 
taken in proper context. One key area that has been ignored up until this 
time is the conditions under which the experimental best was taken. The 
examinees v;ure given instructions for the usual formula scored test sections. 
It was not possible to use directions specifically designed for a priori 
option weightii saying that the respondent coiiLd receive some form of 
"partiaJ credit" for wrong ansv;ers- -because it was believed that by 



introducing such directions, examinees would recognize the section 
being experimental and be less likely to respond in earnest. It v/as also 
believed that such directions would increase, administrative costs. The 
result of using these directions certaialy contributed to the poor shoz/inr'; 
^ P^iQ3:i option weighting. 

Another related point is that by directing the Test Development Divlsjon 
to construct dis tractors of differential quality may have given empirical 
keying an edge over the conventional method. Certainly the variancr;s oT the 
option weights were highly stable given the sample size and melhod of item 
construction (see Echternacht, 1973). 

Cost 

The cost of constructing a priori items was significantly higher than 
that for the traditional items. In general, the cost of constructing t-he 
a priori items ran about 6ofo greater. Thus, for the a priori method to 
prove cost-ef I active, an increase in reliability would have to be obtaine'l 
that would allow the l8-item test to be reduced to an IJ-itei'. test. Such 
an increase in .'eliability was not noted in this study. 

Conclusions 

It becomes obvious that, in this case, the a priori option weighting 
was inferior to that of empirical option weighting. In fact, a priori 
option weighting did not even measure up to traditional formula scoring -..u oh 
respect to reliability on the items. Thus it appears that by using only 
empirical option weights, one can cut the cost of developing items (reduce 
the length of the test) and maintain standards of reliability, at least in 
the case of the GRE Aptitude Test. 
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There are still many details that need to be worked out before such a 
procedure can become operational. For exajnple, how do you explain the 
scoring to an examinee? What should his strategy be? And also, what can or 
should be done with an item where a wrong answer receives more weight than 
thf correct answer (one such item appeared in the special section)? 
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Table 1 



Values of Alpha for Each Scoring Scheme and Sample 



Sample 



Scoring 


2 


3 


4 


5 


6 


7 


Number Correct 


.800 


.820 


.835 


.828 


.824 


.823 


Formula Score 


.788 


.806 


.823 


.814 


.809 


.810 


A priori 


.773 


.791 


.807 


.799 


.798 


.797 


Empirical 


.830 


.847 


.859 


.856 


.851 


.850 


Zl 


.779 


.191 


.815 


.805 


.800 


.801 


Z2 


.11 li 


.792 


.809 


.799 


.795 


.796 
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Table 2 

Correlations between the Experimental and Regular Quantitative 
Test Sections for Each Scoring Scheme and Sample 



Sample 



Scoring 


2 


3 


4 


5 


6 


I 


Number Correct 


.850 


.846 


.859 


.847 


.843 


.848 


Formula Score 


.854 


.847 


.855 


.8451 


.844 


.844 


A priori 


.840 


.841 


.849 


.838 


.833 


.838 


Empirical 


.862 


.864 


.873 


.8591 


.849 


.858 


Zl 


.851 


.841 


.849 


.840 


.840 


.838 


Z2 


.847 


.837 


.846 


.835 


.838 


.835 



